NRG-GY003 WES Analysis

Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors

Test
Author
Affiliation

Kelsey Monson, PhD, MS

Icahn School of Medicine at Mount Sinai

Published

July 31, 2025

Keywords

R, Data Analysis, Data Viz

Intro

Variant Filtering Strategy

I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:

Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.

Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:

  • First find consensus somatic calls:

    • Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
    • Include variants annotated with PASS
      • PASS: indicates the variant passed all quality filters applied by the variant caller

      • Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters

    • Further refine PASS variants
      • Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors

      • Allele frequency 0.05 < AF < 0.95:

        • >0.05: also helps avoid random sequencing errors/miss-called variants

        • <0.95: helps avoid germline contamination

  • Identify rare pathogenic germline variants:

    • Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)

    • Identify those most likely to be pathogenic

      • Limit to protein-coding variants (drop all intergenic and non-coding variants)

      • Must be annotated with HIGH impact

    • Eliminate common variants

      • “Rare” variants are typically those present in <1% of the population

      • Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.

      • I revised the script to set the threshold to 1% and will re-generate the MAF file eventually

      • For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.

    • Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)

  • Generated consensus MAF file

    • The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants

    • These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs

    • I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.

This consensus MAF is what is used for the downstream analysis presented here.

Loading in the data

Packages Used
# Load packages

library(maftools) # For majority of maf file analysis
library(mclust)
library("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)
library(NMF) # For signature calculation
library(pheatmap) # For nice heatmaps
library(ghibli) # For pretty colors

Load in raw MAF file and the clinical data

laml = read.maf(maf="input/merged_consensus_all_samples_germline.maf", 
                clinicalData ="input/NRG-GY00_laml_annot_2.csv",
                rmFlags = TRUE # "FLAGS, frequently mutated genes in public exomes"
                )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668 
-Summarizing
-Processing clinical data
-Finished in 3.380s elapsed (3.170s cpu) 
Tip

Setting rmFlags = TRUE removes frequently mutated genes in public exomes

Details on these genes can be found here.

Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)
laml@clinical.data$response <- factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))
#table(laml@clinical.data$response)

# Detailed response including durable and progressive SD
laml@clinical.data$response2 <- factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))
# table(laml@clinical.data$response2)

# Response by treatment
laml@clinical.data$response_tx <- factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))
#table(laml@clinical.data$response_tx)

# Race (White, Non-White, and Not Reported)
laml@clinical.data$race2 <- factor(laml@clinical.data$race2, levels=c("W","NW","NR"))
#table(laml@clinical.data$race2)

# Site (dichotomous)
laml@clinical.data$Site2 <- factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))
#table(laml@clinical.data$Site2)

# Primary/Met
laml@clinical.data$Primary_Met <- factor(laml@clinical.data$Primary_Met, levels=c("P","M"))
#table(laml@clinical.data$Primary_Met)

Subsetting datasets

We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.

Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.

Only tumor samples

I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.

laml_tum = subsetMaf(maf = laml, clinQuery = "Tissue %in% 'Tu'")
-subsetting by clinical data..
--86 samples meet the clinical query
-Processing clinical data
laml_tum = subsetMaf(maf = laml_tum, query = "vcf_id %in% c('SOMATIC','GERMLINE_PATHOGENIC') ")
-Processing clinical data

Other subsets

We can also subset by tumor histology, treatment, and response.

## Comparing Serous vs others
laml_ser = subsetMaf(maf = laml_tum, clinQuery = "celltype %in% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--71 samples meet the clinical query
-Processing clinical data
# Subsetting the other dataset to "not serous"
`%ni%` <- Negate(`%in%`)
laml_other = subsetMaf(maf = laml_tum, clinQuery = "celltype %ni% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--15 samples meet the clinical query
-Processing clinical data
## Comparing CB vs NCB
laml_CB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'CB'")
-subsetting by clinical data..
--29 samples meet the clinical query
-Processing clinical data
laml_NCB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'NCB'")
-subsetting by clinical data..
--57 samples meet the clinical query
-Processing clinical data
## Subsetting by treatment to see if there are differences
### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.
laml_nivo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
laml_combo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Combo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data

View the MAF file summary

Here is a basic summary of the MAF file

laml_tum
An object of class  MAF 
                        ID summary    Mean Median
                    <char>  <char>   <num>  <num>
 1:             NCBI_Build  GRCh38      NA     NA
 2:                 Center       .      NA     NA
 3:                Samples      86      NA     NA
 4:                 nGenes    6897      NA     NA
 5:        Frame_Shift_Del     234   2.721    2.0
 6:        Frame_Shift_Ins      43   0.500    0.0
 7:           In_Frame_Del      83   0.965    0.0
 8:           In_Frame_Ins       6   0.070    0.0
 9:      Missense_Mutation    9193 106.895   81.0
10:      Nonsense_Mutation     524   6.093    4.0
11:       Nonstop_Mutation      13   0.151    0.0
12:            Splice_Site     505   5.872    3.0
13: Translation_Start_Site      19   0.221    0.0
14:                  total   10620 123.488   92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.

# #Shows sample summary.
# getSampleSummary(laml_tum)

# #Shows gene summary.
# getGeneSummary(laml_tum)

# #shows clinical data associated with samples
# getClinicalData(laml_tum)

# #Shows all fields in MAF
# getFields(laml_tum)

Visualization

Summarize the maf file

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml_tum, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

#Use mafbarplot for a minimal barplot of mutated genes.
mafbarplot(maf =  laml_tum)

Summarize the other subsets

Here are the mutation distributions for the other subsets.

As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).

By Histology

Serous
## Serous 
plotmafSummary(maf = laml_ser, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_ser)

Non-Serous
## Non-serous
plotmafSummary(maf = laml_other, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_other)

By Response

CB
## CB
plotmafSummary(maf = laml_CB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_CB)

NCB
## NCB
plotmafSummary(maf = laml_NCB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_NCB)

By Treatment

Nivo
## Nivo
plotmafSummary(maf = laml_nivo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_nivo)

Combo
## Combo
plotmafSummary(maf = laml_combo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_combo)

Oncoprints

Oncoprints (or “oncoplots” as the wrapper function in maftools is called) summarize complex genomic features of a given dataset.

#oncoplot for top ten mutated genes.
oncoplot(maf = laml_tum, top = 10)

These are the key features and how they are interpreted:

  • Columns represent individual patients.
    • Reading column-wise, you can see the collection of alterations in a patient’s tumor for the given set of genes.
    • The bar plot on the top is a count of tumor mutation burden per patient, color-coded by mutation type.
  • Rows are altered genes.
    • Alterations are color-coded to indicate their type (e.g. mutation, deletion, insertion)
    • Genes can be further annotated with key features (e.g. high impact/likely pathogenic mutations, germline variants, etc.)
  • The bar plot on the right summarizes alterations in a given gene for the entire dataset.
  • Many more features and annotations can be added to further characterize the dataset.
# Highlight by specific attribute in MAF file
oncoplot(maf = laml_tum, 
         additionalFeature = c("IMPACT", "HIGH"))

# Add transitions/transversions  
oncoplot(maf = laml_tum, draw_titv = TRUE)

# cBioPortal alteration nomenclature 
oncoplot(maf = laml_tum, cBioPortal = TRUE)

Oncoprints by clinical data

## Primary vs metastatic sites
oncoplot(maf = laml_tum, clinicalFeatures = 'Primary_Met', sortByAnnotation = TRUE)

## Cell type
oncoplot(maf = laml_tum, clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

## ICI response
oncoplot(maf = laml_tum, clinicalFeatures = c('responseCB','response2'), sortByAnnotation = TRUE)

## ICI response and treatment
oncoplot(maf = laml_tum, clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

## Mutational signature
oncoplot(maf = laml_tum, clinicalFeatures = 'Signature', sortByAnnotation = TRUE)

## Race
oncoplot(maf = laml_tum, clinicalFeatures = 'race', sortByAnnotation = TRUE)

## Site
oncoplot(maf = laml_tum, clinicalFeatures = c('Site2','Site'), sortByAnnotation = TRUE)

Oncogenic signaling pathways

sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.

# Plot genes in 2 top oncogenic signaling pathways
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 2)

# Collapse to just plot the pathways, now top 5
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

Pathways by Response

# Response
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# Response by treatment
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

Pathways by Histology

Since these are the top pathways for the whole cohort, they may be masking histologically-specific pathways.

This is where it is useful to subset the data. Here I am plotting the top pathways for serous and all non-serous patients separately.

# Serous
oncoplot(maf = laml_ser, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous 
oncoplot(maf = laml_other, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Biological processes of known tumor drivers

smgbp plots enrichment for pan-cancer significantly mutated genes, classified by biological processes, per Bailey et al.

Here are the top 2 biological processes of known drivers

# Note that I have highlighted the BRCA1/2 germline variants
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2, 
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by Response & Histology

# Response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Detailed response (note that none of the germline variants had CR or PR)
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response2', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Response by treatment 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by mutational signature*

*Will need to return to this, but it seems like sig1 is more BRCA1 while sig2 is more BRCA2.

oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'Signature', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Top pathways

Now collapsing the plots to show only the pathways

# Top 5 pathways 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

# Top 10 pathways*
# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE)

# By response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# By response and treatment
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

# By histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

Top pathways by histology

As with the oncogenic signaling pathways, I am plotting these separately for serous and non-serous.

# Serous
oncoplot(maf = laml_ser, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous
oncoplot(maf = laml_other, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10,
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Additional oncoprints with more complex annotations

Code
## Color coding
ghibli_colors <- ghibli_palette("PonyoMedium", type = "discrete")
# ghibli_colors
respcolors <- c("#278B9AFF","#E75B64FF")
names(respcolors) = c("CB","NCB")
fabcolors = RColorBrewer::brewer.pal(n = 5,name = 'Spectral')
names(fabcolors) = c("CR", "PR", "SD_CB", "SD_NCB", "PD")
txcolors = c("#BEAED4","#7FC97F")
names(txcolors) = c("Nivo","Combo")
cellcolors = RColorBrewer::brewer.pal(n = 6,name = 'PRGn')
names(cellcolors) = c("Serous_Adenocarcinoma","Endometrioid_Adenocarcinoma","Adenocarcinoma",
                      "Clear_Cell","Mixed_Epithelial","Undifferentiated_Carcinoma")


anno_cols = list(tx = txcolors, response2 = fabcolors, Survtime = "Blues", celltype = cellcolors)
anno_cols2 = list(tx = txcolors, responseCB = respcolors, Survtime = "Blues", celltype = cellcolors)
#print(anno_cols2)
oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('response2','tx',  'Survtime','celltype'),
  sortByAnnotation = TRUE,
  annotationColor = anno_cols
)

oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('responseCB','tx','Survtime','celltype'),
  sortByAnnotation = TRUE,
  annotationColor = anno_cols2,
  pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 2,
  additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC")
)